INTRODUCTION

The major libraries that we use are GGally to get a correlation between many categories, ggplot2 to plot graphs and dplyr for other exploratory data analysis functions

Let’s have a look at a small chunk of the wine dataset.

##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.4             0.70        0.00            1.9     0.076
## 2 2           7.8             0.88        0.00            2.6     0.098
## 3 3           7.8             0.76        0.04            2.3     0.092
## 4 4          11.2             0.28        0.56            1.9     0.075
## 5 5           7.4             0.70        0.00            1.9     0.076
## 6 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5

Quality should be made ordinal

Number of wine

## [1] 1599

Summar of each column

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol      quality
##  Min.   : 8.40   3: 10  
##  1st Qu.: 9.50   4: 53  
##  Median :10.20   5:681  
##  Mean   :10.42   6:638  
##  3rd Qu.:11.10   7:199  
##  Max.   :14.90   8: 18

Univariate Plots

Histogram of acidity of wine

We find most wines have a fixed acidity level 6 to 10, and it follows a normal distribution

Histogram of residual sugar in wine

Most wines use a precise range of sugar in wine of about 2 +/- 1.

Histogram of residual sugar in wine after removing the outliers

Most wines use a precise range of sugar in wine of about 2 +/- 1.

Histogram of chlorides in wine

The amount of chlorides in wine are about 0.1%, and they have very low variability.

Histogram of chlorides in wine after removing outliers.

Let’s have a clearer look at the histogram

Histogram of amount of free sulfur dioxide in wine

Most wines have a distribution of about under 20 of free sulfur dioxide but many go beyond that value

Histogram of density of the wine

Density of wine follows a normal distribution with a value between 0.990 and 1

Histogram of distribution of ph level of the wine

Wine is a fairly acidic drink with a pH of between 3 and 3.6

Histogram of Wine quality rating

Most wines are rated 5 or 6. Good wines - 7 and few rare great wines rated 8.

Alcohol Level Histogram

The plot shows that the alcohol percent us usually between 9-12. Few have higher.

Boxplot of alcohol

Sulphate level Histogram

The amount of sulphate used is usually between 0.4 to 0.8.

removing the outliers, we find the appropriate amount used in most is between 0.4 to 1.

Boxplot of Sulphates

Citric Acidity Level Histogram

The plot shows that the most wine have a citric acid level of 0.5 and below.

Boxplot of Citric Acid

Volatile acid distribution

Histogram

Boxplot of Volatile Acid

UNIVARIATE PLOT ANALYSIS

What is the structure of your dataset?

There are 1599 redwines in the dataset with the features fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density and finally the main one - quality

Changed data?

Quality is blind tested by experts and rated from 1-10. The data type of quality therefore had to be changed to ordinal.

Analysis

Quality is blind tested by experts and rated from 1-10. Most wine is rated between 5 to 7, having an alcohol level between 9.5 to 11.1. The plot shows that the range most of the quality score fall would be between mostly 5,6 and many also have score of 7.

Bivariate Plots

All possible bivariate graphs

To get a wider idea about the entire data, the summary of all fields, against all fields are created in graph below. We furthur investigate based on this result. Features that have a good normal distribution and Features that have a high correlation.

Alcohol vs Quality

Alcohol level histogram, mapped with wine quality

Density vs Alcohol

Sulphate vs. Quality

high sulphate vs quality

focusing on drinks with high amount of sulphate

Sulphate and alcohol shows almost no correlation. They are hence definitely 2 independent factors that affect the quality of the alcohol

Citric Acid vs quality

Higher citric acid and lower volatile acid

Volatile acid vs quality

Fixed acid vs quality

Density vs quality

volatile acidity vs PH. Interesting observation.

Another interesting observation that doesn’t make much sense is how volatile acidity has a positive correlation on pH. pH increases with reduced acidity. And it’s not a correlation but science. But according to this correlation, the pH actually increases.We try to plot this graph and find out a lot of outliers in this case, which might be causing the

BIVARIATE ANALYSIS

Alcohol

We can see a clear correlation between quality and alcohol level

Sulphates

There is a clear correlation betwen sulphate level and quality. We however furthur observe how some drinks try to add high amount of sulphates. But their drink quality are usually rated average. And the best drinks don’t try to push the level of sulphates. Lets subset that particular data and find how much alcohol they have. We find that although the sulphate level are high, the alcohol level is low in the average quality alcohol.

Volatile acid

Volatile acidity refers to the steam distillable acids present in wine, primarily acetic acid but also lactic, formic, butyric, and propionic acids. So lesser the volatile acidity, the better the wine quality.

Density vs Alcohol

Density has a negative correlation with alcohol, and possibly reduces with reduced alcohol content.

Sulphate is definitely important

Sulphate marginally improves quality of drink according to the graph below. But there are a lot of high level sulphate drinks that are merely average and not exceptional. We investigate this in the next visualisation.

Why does sulphate matter?

It plays an important role in preventing oxidization and maintaining a wine’s freshness.

Multivariate Plots

Analysing alcohol Alcohol, Density and it’s relation with Quality

Analysing sulphate, alcohol and it’s relation with quality

Analysing drinks with high level of sulphate with alcohol and quality.

When you don’t have citric acid, you end up having volatile acid

We put both together and look for a pattern. Quality of drink has a lighter shade of blue. We notice how low volatile acidity and high citric acidity region has better quality of drinks.

MULTIVARIATE ANALYSIS

Alcohol, Density and Quality

It is difficult to say if it’s the higher alcohol quantity or the lower density that the experts prefer. Because higher alcohol, is correlated with lower density. High alcohol & low density should give a better quality of alcohol acording to the correlations. But high alcohol is related to low density in the first place, we find that with increasing alcohol, density reduces. This proves that buying high alcohol drinks, means, buying low density drinks as they both are slightly related. At the range of wine that were scored 6 and 7, you notice that many of them have a sulphate level higher than 0.8. But either ways, higher alcohol or lower density is correlated with better quality of drinks.

Sulphates, Alcohol and Quality

We already know from the bivariate analysis that sulphate, alcohol aren’t related like alcohol and density was. So it is a clear indication that high level of sulphates and alcohol, improves the quality of drink.

Drinks with high level of sulphates

Drinks with high amount of sulphate investigated along with alcohol. We find that high amount of sulphate helps make the drink good, but then alcohol helps it make the drink even better. But if you don’t have Sulphate, then you can get away with high Alcohol Level.

FINAL PLOTS SECTION

Citric Acid vs quality

Higher citric acid helps in improving the quality of drink, significantly. And from univariate analysis we find that most citric acids are rated low.

Analysing sulphate, alcohol and it’s relation with quality

We see a clear pattern of how higher quality drinks mostly lie in the region of higer sulphate and alcohol level.

High alcohol means, average or better quality drink at the least.

We can observe that almost every drink with alcohol level of above 11% have a rating of 6 and above which is average. We also notice how the ratio of drink being rated 8 and above over the rest, increases gradually with alcohol level of 12% and above.

At first, it seemed like there was hardly any correlation. The data didn’t make much sense. But on furthur investigation, subtle patterns and informations were retreived which helped build an idea about what chemicals might help in improving the quality of wine.

There is not much of a correlation between wine and it’s chemicals. There is no golden bullet formula to making a great wine, with only chemicals according to this analysis. But we can consider three elements. Alcohol, citric acid and sulphate helps it make the drink better.

REFLECTION

At first, it seemed like there was hardly any correlation. The data didn’t make much sense. But checking the bivariate correlations gave a better sense of the data. A combination of two or more features seemed to affect the quality of the analysis.

More often I got stuck trying to implement a good visualisation using the colour palette. And the syntaxes were pretty unusual for me as it was my first time with R. But finding the right examples and resources outside did help me overcome these problems.

Finding other interesting patterns also helped in building a better plot. One important question that bothered me was how pH increased with increase in volatile acid, when it was supposed to be reduced. I have my own theories, but I don’t have any information about it that is credible, but my own opinions. It would be worth investigating in the future.